High Performance Cholesky Factorization via Blocking and Recursion That Uses Minimal Storage
نویسندگان
چکیده
We present a high performance Cholesky factorization algorithm , called BPC for Blocked Packed Cholesky, which performs better or equivalent to the LAPACK DPOTRF subroutine, but with about the same memory requirements as the LAPACK DPPTRF subroutine, which runs at level 2 BLAS speed. Algorithm BPC only calls DGEMM and level 3 kernel routines. It combines a recursive algorithm with blocking and a recursive packed data format. A full analysis of overcoming the non-linear addressing overhead imposed by recursion is given and discussed. Finally, since BPC uses GEMM to a great extent, we easily get a considerable amount of SMP parallelism from an SMP GEMM. In 1], a new recursive packed format for Cholesky factorization was described. It is a variant of the triangular format described in 5], and suggested in 4]. For a reference to recursion in linear algebra, see 4, 5]. The advantage of using packed formats instead of full format is the memory savings possible when working with large matrices. The disadvantage is that the use of high-performance standard library routines such as the matrix multiply and add operation, GEMM, is inhibited. We combine blocking with recursion to produce a practical implementation of a new algorithm called BPC, for blocked packed Cholesky 3, pages 142-147]. BPC rst transforms standard packed lower format to packed recursive lower row format. The during execution algorithm BPC only calls standard DGEMM and level 3 kernel routines. This method has the beneet of being transparent to the user. No extra assumptions about the input matrix needs be made. This is important since an algorithm like Cholesky using a new data format would not work on previous codes. Algorithm BPC outperforms level 3 LAPACK 2] routine DPORTF on three IBM platforms, the POWER3, the POWER2, and the POWERPC 604e, even when
منابع مشابه
Parallel and fully recursive multifrontal sparse Cholesky
We describe the design, implementation, and performance of a new parallel sparse Cholesky factorization code. The code uses a multifrontal factorization strategy. Operations on small dense submatrices are performed using new dense matrix subroutines that are part of the code, although the code can also use the blas and lapack. The new code is recursive at both the sparse and the dense levels, i...
متن کاملOptimizing Locality of Reference in Cholesky Algorithms1
This paper presents the principle ideas involved in hierarchical blocking, introduces the block packed storage scheme, and gives the implementation details and the performance rates of the hierarchically blocked Cholesky factorization. In some cases the newly developed routines are faster by an order of magnitude than the corresponding Lapack routines. Introduction Most current computers based ...
متن کاملLAPACK Cholesky Routines in Rectangular Full Packed Format
We describe a new data format for storing triangular and symmetric matrices called RFP (Rectangular Full Packed). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to store triangular and symmetric matrices waste half the storage space but provide high performance via the use of level 3 BLAS. Packed format arrays fully utilize storage (array space) b...
متن کاملParallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky
We describe the design, implementation, and performance of a new parallel sparse Cholesky factorization code. The code uses a supernodal multifrontal factorization strategy. Operations on small dense submatrices are performed using new dense-matrix subroutines that are part of the code, although the code can also use the BLAS and LAPACK. The new code is recursive at both the sparse and the dens...
متن کاملBlock Sparse Cholesky Algorithms on Advanced Uniprocessor Computers
As with many other linear algebra algorithms, devising a portable iniplementation of sparse Cholesky factorization that performs well on the broad range of computer architectures currently available is a formidable challenge. Even after limiting our attention to machines with only one processor, as we have done in this report, there are still several interesting issues to consider. For dense ma...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000